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Abstract 

We consider the ID Expected Improvement optimization based on Gaussian pro- 
cesses having spectral densities converging to zero faster than exponentially. We give 
examples of problems where the optimization trajectory is not dense in the design 
space. In particular, we prove that for Gaussian kernels there exist smooth objective 
functions for which the optimization does not converge on the optimum. 
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1 Introduction 

The optimization problem. Consider a global "black-box" optimization problem 

f(x) — y mm (1) 
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For the moment, suppose that V is a compact metric space, and / a continuous real- valued 
function on D, so that the minimum /* = min xe x> f(x) exists. Consider an optimization 
procedure seeking this minimum. In black-box optimization, such a procedure consists of a 
sequence of iterations; each iteration suggests for evaluation a new point of the set V based 
on the already observed values of the objective function /. More precisely, we can say that, 
algorithmically, optimization is defined by the initial point x-y e T> and a family of mappings 

A K : (V xR) K ^V, A" = 1,2,.... 

The optimization trajectory {xk}k=i is then determined by relations 

xk+1 = A K ({x fc , f(x k )}ti) , K = 1, 2, . . . . (2) 

Any practical optimization is terminated at some step K, and the approximate minimum 
f K is then defined by 

f* K := min f(x k ). 

k=l,...,K 

It is then natural to call optimization consistent if 

lim f* K = /*■ 

K — s-oo 

The following proposition is a very simple but important criterion of consistency on the 
space of continuous functions [16] . 

Proposition 1. An optimization algorithm defined by mappings Ak is consistent for all 
f G C{V) if and only if for any continuous f the trajectory {xk}k=i generated by (2) is 
dense in V. 

The sufficiency is clear; the necessity follows since any continuous function can be modi- 
fied, preserving its continuity, in any open set so as to make the function attain its optimum 
in this open set. 

In many practical applications, the objective function / is expensive to evaluate, and 
the mappings A can then be quite complex and resource-intensive; in particular they often 
involve solving auxiliary optimization problems. A popular modern approach to global black- 
box optimization is stochastic Bayesian optimization where these auxiliary problems are 
stated using some prior assumptions of probabilistic nature. In this paper we will consider 
one of the most natural and well-known methods of this type - optimization by Expected 
Improvement [4-6. 10 . 11. 14. 15] 

The Expected Improvement algorithm (EI). In this method, we think of the opti- 
mized function / as a realization of a stochastic process (£ x )xev- Assuming the probability 
measure associated with the process is known, we define the mappings Ak by maximizing 
the expectation of the improvement in the best known value of the objective function result- 
ing from its additional evaluation, conditioned on the set {£ Xk = f(xk)} k=1 . Precisely, we 
define 

A K ^{xkJ(xk)}k=i^ = aiginaxI K . {xk>f{xk)} K =i (x), (3) 
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where 

I K;{x k j{x k )}^ = J< x ) = E(/£ - mm(f^,Q\{£ Xk = /(x fe )}f =1 ). 

In practice, the stochastic process £ x is usually Gaussian, which allows one to numerically 
solve the optimization problem 

Wfe)}f =1 W^^ x ( 4 ) 

for moderate values of K. Namely, assume that £ x is a centered Gaussian process with the 
covariance 

G(x,y) = E(Uv)- 

Then £ x conditioned on {£ Xfc = f(xk)}k=i is also a Gaussian random variable: 

tx\{tx k = f(Xk)}k=l ~' /V '(" l x;{x fc) /(x fc )}f =1 ' <7 x;{* fc }f =1 )' 

where m and cr 2 denote the conditional mean and variance. Note that since the process 
is Gaussian, the variance depends on {xk}k=i but not on {f(xk)}k=i- A straightforward 
calculation shows that 

™z;{z fe ,/(z fe )}f =1 = gtf 1 *Gtf 1 f K -, (5) 

^;{^}f =1 = x ) ~~ g^ G ^g^. ( 6 ) 

where 

f K = (f(x 1 ),...,f(x K )) t , 
gK,x = (G(x, Xi),..., G(x, x K )Y, 
G K = (G(x k , Xl ))« l=1 . 

Throughout the paper, we will assume that the kernel G(x, y) is strictly positive definite, 
which in particular ensures that Gk in (5), (6) is invertible. 

If G is continuous, then the conditional mean and variance continuously depend on x, 
which implies existence of the maximum in (4). The maximum can be attained at more 
than one point; any of them can be taken as xk+i- Up to this ambiguity, the EI algorithm 
is completely determined by the kernel G. 

Note that if the kernel G is strictly positive definite, then 



0, x G {xfc} fc=1 , 
> 0, x i {x k }% =1 , 



so that the maximizer xk+i ^ {xk}% =1 , i.e., all the points of the trajectory {xk}kLi are 
different. 

Consider the Hilbert space L 2 (Q, P), where (Q, P) is the probability space on which the 
process £ x is defined. Then one can geometrically interpret the conditional variance ct^.^^k 



as the squared distance between the vector £ x and the linear span of the vectors {Cx k }k=i m 
L 2 (n,P). 

Practical implementations of the EI algorithm often use somewhat more complex mod- 
elling than described above, based on kriging [6]. This approach includes additional poly- 
nomial trends in the model; also, the covariance function is assumed to depend on a few 
parameters which are adjusted at each iteration using cross-validation or maximum likeli- 
hood estimates. We will not consider these complications in this paper. 

We will fix a kernel G and will treat the EI algorithm described above as ideally imple- 
mented with this kernel, in the sense that the auxiliary problem (4) is assumed to be exactly 
solved at each iteration K. We will then be interested in the convergence properties of the 
resulting sequences xk and f(xx)- 

Previous rigorous results. EI is a popular approach to global optimization in modern 
engineering applications, but not much has been proved about it rigorously. If £ x is the 
Wiener process or its stationary version, the Ornstein-Uhlenbeck process, on a segment in 
R, then, using the Markov property, it is not hard to check that the EI optimization is 
consistent for continous objective functions, see [9]. In [17 , 18] , Vazquez and Beet considered 
the general case of compact subsets of R n and proved the convergence of the EI algorithm for 
sufficiently "rough" stationary processes £ x , on objective functions / from the reproducing- 
kernel Hilbert space (RKHS) associated with £ x . 

As an intermediate step in their proof, these authors consider what they call the No- 
Empty-Ball (NEB) property of the process £: 

Definition 1. The process £ x is said to have the NEB property if for all sequences {x k }^ =1 
(not necessarily given by 2) and all points x in V the following two conditions are equivalent: 

1. x belongs to the closure of {xk} k % 1 ; 

2. the conditional variance o r \k — > as K — >• oo. 

The first condition clearly implies the second for processes with a continuous covariance 
function, but the opposite direction is more subtle. In particular, Vazquez and Beet prove 
that the NEB property is violated by Gaussian processes with a Gaussian covariance function. 
They show, however, that the NEB property holds for a stationary process provided its 
spectral density goes to zero sufficiently slowly, namely if its inverse is polynomially bounded. 
Additionaly, they show that if a Gaussian process has the NEB property and the objective 
function is from the corresponding RKHS, then the optimization trajectory is dense in V, 
and hence optimization is consistent on this space. 

Vazquez and Beet also show that for Gaussian processes with the NEB property the 
optimization trajectory is dense almost surely, if the optimized function is a realization of 
the process. 

Recently, Bull [JL] has obtained rigorous convergence rates for objective functions from 
the RKHS of the process. 
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Some rigorous results are also available for certain stochastic optimization algorithms 
different from but closely related to EI, see, e.g., Gutmann [3]. 

Finally, though in this article we don't consider covariance functions with adaptively 
adjusted parameters, we mention that these more general kinds of EI optimization are known 
to be inconsistent in some cases [1,9]. 



2 Results 

As discussed above, the existing rigorous results about convergence of the EI optimization 
are mostly proofs of convergence under certain assumptions, namely when the NEB property 
holds and/or the objective function belongs to the RKHS associated with the process. At 
the same time, little is known rigorously about (in) consistency of the EI optimization when 
these assumptions are violated, though, for example, the Gaussian kernel is one of the most 
common kernels used in practical modelling [2] , while in engineering applications one rarely 
expects strong regularity of the objective function. 

Vazquez and Beet [17, 18] conjecture consistency for all continuous objective functions 
provided the process has the NEB property. The result of Locatelly [9] confirms this in the 
case of the Wiener process. 

The goal of this paper is to examine convergence of the EI algorithm for analytic Gaussian 
processes. More precisely, we will consider kernels with spectral densities which very rapidly 
converge to 0; this property is related to analyticity by Paley- Wiener-type theorems (see, 
e.g., [7], page 209). Our main result is a class of examples demonstrating some lack of 
consistency of the EI optimization in this case, for objective functions which are not analytic. 
We thus show, in particular, that the EI optimization cannot be fully consistent if both the 
NEB and RKHS assumptions are dropped. 

We will consider only ID models in this paper and let V = [—1, 1]. We consider a 
translation invariant covariance G, i.e. 

G(x', x") = G(x' - x", 0) = G(x' - x"), 

and assume that it has a spectral density G(t),t E M: 

G(x) = [ G(t)e itx dt, G(t) = / G{x)e~ itx dx. 
Jr 2n Jr 

Since G is real and even, such is G: G(t) = G(—t) E IL We assume that G(t) > for all t, 
so that the kernel G is strictly positive definite. 

We start by showing, as a preparation for the main result, that if 

G(t) < c e- c '<l, (7) 

with some c , c > 0, then the process £ x does not have the NEB property. Condition (7) 
implies, in particular, that G is analytic in the strip | lm(z) \ < c. 
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Theorem 1. Let (^ x ) xe [_i j i] be a centered stationary Gaussian process defined on a prob- 
ability space (Q, P). Suppose that the spectral density G of the process satisfies condition 
(7) with some c$,c > 0. Let A be any infinite subset of the segment [—1,1]. Then all the 
random variables (£ x ) x e[-i,i] belong to the closed linear span of the random variables (C, y ) y eA 
inL 2 (tt,P). 

We next indicate a class of optimization problems where the EI optimization trajectory 
is provably not dense in [—1, 1]. 

In proving this result, we have found especially useful a pair of asymptotic bounds for 
the conditional variance, which we will state now as a separate theorem. 

Suppose that the spectral density is represented in the form 

6(t) = e- 8 M> = e- T Q*W (8) 
with some functions S G C 2 (R + ),T G C 2 (M). We will assume that 

S'(t),S"(t)>0 fort>0, (9) 

and 

S'(t) ->■ +oo as t ->• +oo. (10) 
Condition (9) implies, in particular, that T is convex, since 

T"(s) = (S(e s ))" = e s S'(e s ) + e 2s S"(e s ) > 0. (11) 

Also, 

T'(s) ->• +oo as s ->■ +oo, (12) 

since 

T'(s) = e s 5'(e s ). (13) 
Let T* be the Legendre transform of T: 

T*(q) = maxfos — T(s)). 

Then, by (12), T*(q) is finite for all sufficiently large q; the point s* where the maximum is 
attained satisfies the condition T'(s*) = q. 

Theorem 2. Suppose that the spectral density of the covariance function G is represented 
in the form (8) so that conditions (9) . (10) hold. Then, for sufficiently large K, the following 
inequalities hold for any K + 1 different points x, x±, . . . , % G [—1, 1] : 

2 

< F{K ^K k} r 12 ^ e2K > ( i4 ) 

e F{K) Uk=i \x-x k \ 2 

where 

F(K) = T*(2K + 1) - (2K + 1) \nK. 

Furthermore, F(K) monotonically decreases for sufficiently large K , and 

->■ -oo as K ->■ +oo. (15) 

ii 
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An example of a family of spectral densities covered by this theorem is 

G(t) = (16) 
with a > 0, b > 1. In particular, with 6 = 2 this gives the Gaussian covariance functions 



G(x) = ^=e-l (17) 



Spectral densities (16) correspond to 



5(1*1) = a\t 
T{s) = ae 



bs 



so that conditions (9), (10) hold for all a > 0, b > 1. We find in this case 

n?) = ?fin^-i 



b \ ab 

o V ao 

We state now our main result on the EI optimization. Recall that if G is a positive 
definite kernel, then G(0) = m&x xeK G(x). 

Theorem 3. Under assumptions of Theorem _2, consider optimization problem (1) onV = 
[—1,1] with the objective function 

f = ~G. 

Suppose that the EI optimization with the kernel G starts from the point x\ — (i.e., the 
point where the minimum is already attained). Then the optimization trajectory {xk}'^ =1 
converges to 0; in particular the trajectory is not dense in [—1,1]. Moreover, for sufficiently 
large K 

e 2Kp{K) <\x K+l \<e F ^'\ (18) 
where F is as defined in Theorem 2, 

As pointed out in Proposition JL, Theorem 3_ implies that there are continuous objective 
functions for which the EI optimization is inconsistent. Such functions can be obtained by 
modifying the objective function — G on a set not containing points of the corresponding 
trajectory {xfc}^ =1 . Recall that under assumptions of Theorems 2. and 3_ the kernel G is 
analytic. We cannot modify — G preserving its analyticity, but can modify it preserving 
its infinite smoothness. Recalling the family of examples discussed above, we obtain, in 
particular, the following corollary regarding the Gaussian covariance function. 

Corollary 1. For any Gaussian covariance function (17) . there exists an objective function 
f G C°°([— 1, 1]) such that the EI optimization of f starting with x± — is not consistent. 
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We briefly discuss now practical implications of these results. 

On the one hand, there are certain caveats to their practical interpretation. First, we 
consider only the simplest version of the EI optimization in ID, while real applications 
are mostly higher- dimensional. Second, realistic optimization budgets may be too low in 
many problems for the indicated asymptotic behavior to be relevant. Third, the theoretical 
consistency, as such, may in principle be restored by trivial adjustments of the algorithm, 
e.g., by occasionally alternating the EI trajectory with a fixed dense sequence in T>. 

Nevertheless, our results suggest that, in general, EI algorithms with analytic kernels are 
not reliable beyond a narrow class of very smooth functions - at least hard to justify theo- 
retically. Moreover, they may be prone to early ill-conditioning and numerical instabilities 
due to excessive accumulation of trajectory points (see also the numerical example below). 
It appears that for practical numerical optimization of generic objective functions, if EI is to 
be applied, then a more reliable choice for the covariance would be a rough kernel with an 
inverse polynomial falloff of the spectral density, for example from the Matern family (see, 
e.g., [13]). 

In the next section we report a numerical test of Theorem 3, Then, in sections 4_-6 we 
provide the proofs of Theorems JL-3, 



3 A numerical example 

To confirm Theorem 3_, we report direct numerical results of the EI optimization of the 
objective function f(x) = —e~ x performed with the kernel G(x) = e~ x . 

In short, our numerical procedure is as follows. At each iteration, for any trial point x we 
compute the parameters m x ,a x of the associated posterior Gaussian variable by explicitly 
using formulas (5), (6). We then compute the expected improvement at x using the well- 
known formula (see [5]) 

I K (x) = (f* K + 

where ip and ^ and the standard normal density and cumulative distribution function, 
respectively. To optimize Ir{x) over x, we simply sample x uniformly on a logarithmic 
scale: precisely, we try x = ±e~ u , where e = 0.02 and / = 0, 1, . . . , 10 4 . 

We should point out that this numerical procedure is quite unstable for our kernel and 
objective function. As the posterior variance of the process rapidly converges to and the 
trajectory {xk} to X\ = 0, computation of the expected improvement involves, in several 
places, subtraction of almost equal quantities, in particular in (6_). Also, the matrices 
quickly get ill-conditioned. As a result, precision of, for example, the usual "double" floating 
point format, which has the 53-bit significand (approximately 16 decimal digits), is exhausted 
very soon during this optimization. For this reason, we perform our test with the extended 
precision of 300 decimal digits, using the free library Mpmath [12] for that purpose. 

The first 10 elements of the trajectory appear then to be reasonably reliably computed, 
and are shown in Table !_, together with the corresponding expected improvements. This 
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result confirms Theorem 3., also suggesting the actual asymptotic of xk is closer to the lower 
rather than upper bound in (18). 



1 





— 


2 


-0.63 


0.16 


3 


0.77 


0.13 


4 


0.23 


0.025 


5 


-0.1 


0.0013 


6 


0.0036 


3.4e-06 


7 


-7.3e-06 


1.4e-ll 


8 


2.8e-ll 


2.2e-22 


9 


-4.1e-22 


4.5e-44 


10 


7.9e-44 


1.7e-87 



Table 1: The first 10 elements of the EI optimization trajectory with the respective expected 
improvements, for the kernel G(x) = e~ x and the objective function / = — G. 



4 Proof of Theorem 1 

Let the Hilbert space % be the closed span of the Gaussian random variables (£ x )xe[-i,i] i n 
L 2 (f2, P). We use the canonical isometry between H and L 2 (R, G): 

Z x eU^0 x eL 2 (R,G), (19) 

where 
so that 

= G(x - x') = e Ux e~ Ux 'G(t)dt = (<j> x , ^ L ^ dy 

In terms of this isometry, the claim of Theorem 1_ is that for any x G [—1, 1] the function <p x 
can be approximated in L 2 (M, G) by finite linear combinations of functions (4> y ) ye A- 
We first prove the following 

Lemma 1. Let x be any point in [—1,1], and {xk}^ =1 any infinite sequence of points in 
[— 1, 1] such that x k ^ x\ for k ^ I and \x — x k \ < | for all k, where c is from (7). Then, 
assuming (7_), <p x can be approximated in L 2 (M.,G) with arbitrary accuracy by finite linear 
combinations of <p Xk . 
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Proof. By the theory of polynomial interpolation (see, e.g., [8]), for any positive integer K 
we can choose coefficients Xr,i, . . . , \r,k such that for any polynomial p with degp < K we 
have 



K 



P( x ) = Y, X K,kP( x k), 
k=l 



(20) 



namely, 



A 



x - Xi 



K,k 



K 

1=1 X k X\ 
l^k 



The r.h.s. of ( 20) is a polynomial in x of degree < K. If a function p(x) is not a polynomial 
of degree < K, then the difference between the left and right sides of ( 20) can be interpreted 
as the error of polynomial interpolation and written in terms of divided differences of p: 

K K 

p( x ) ~ J2 ^K,kP( X k) = p[x, xi,..., x K \ Y[ (x - x k ). 
k=l k=l 



By the Hermite-Genocchi formula, 



\p[x,X!,...,X K ]\ < 



d K p 

dx K 



cotlv(x,x\,...,xk) 



where || • ||conv(x,xi,...,xjf) denotes the maximum over the convex hull of the points x, x\, . . . , xk- 
In particular, if p(x) = e lxt with some t e R, then 



\p[x,x 1 ,...,x K ]\ < 



K\ 



Accordingly, 



and hence 



K 



Axt 



Y ^K,ke 



ixfrt 



K 



0x - Yl ^K,k<f>. 



k=l 
2 



<-F7]{[\ x - x k\ 

X1 • k=l 



k=l 



< 



r t 2K K 



(21) 



L 2 (R,G) 

Now recall that G(t) < c e~ c l*l with some c , c > 0, and \x — x k \ < |. Then 

2 

4>x — J! ^ K , k 



k=l 



< 



L 2 (R,G) 



2c 

c 2K+1 (K\) 2 V4 



2K ,+oo 

/ s 2K e- s ds 
Jo 



< 



2c (2K)\ 



c4 2K {K\) 2 
= O (2- 2K ) K -^?0, 

where we used Stirling's formula in the last step. 



□ 
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Now, let B denote the set of all those points a; in [—1,1] for which <p x can be approximated 
by linear combinations of ((f> y ) y eA- We prove that B = [—1, 1] in several steps. 
Statement 1. B has a non-empty interior. 

Indeed, since A C [—1, 1] is infinite, we can find an interval of length | in [—1, 1] that 
contains infinitely many points of A. Then, by the above lemma, any point of this interval 
belongs to B. 

Statement 2. If x belongs to the interior of B and \x' — x\ < | for some x' G [—1, 1], then x' 
also belongs to the interior of B. 

Indeed, it follows from the hypothesis that we can find infinitely many distinct points 
Xk in B such that \x' — Xk\ < f for all of them. By the above lemma, (p x > can then be 
approximated by finite linear combinations of (p Xk . But, since Xk G B, any <p Xk can in 
turn be approximated by finite linear combinations of (<f> y ) y eA- It follows that (f) x > can be 
approximated by finite linear combinations of ((p y ) y ^A, i-e., x' G B. 

Now, in the above argument, x' could be replaced by any x" sufficiently close to x' so 
that \x" — x\ < | still holds, and we would get x" G B. It follows that x' belongs not only 
to B, but even to the interior of B. 
Statement 3. B = [-1,1]. 

This follows immediately from statements 1 and 2. 

This completes the proof of Theorem 1 . 

5 Proof of Theorem 2 

We use again the canonical isometry (19) to express the conditional variance cr?.r xK as 

2 

G(t)dt. 

We start now with the proof of the upper bound in (14). We already know from the 
proof of Theorem 1 that (see (21)) 

-w, < ^ni^-^Pi^)*- ( 22 ) 

We substitute t = ±e s in the integral on the r.h.s.: 

/ t 2K G(t)dt = 2 / e i2K+1 >~ T ^ds. (23) 

We can now derive an upper bound for this integral using a basic form of the Laplace method. 
Consider the function T K (s) := (2K + l)s — T(s) which is concave by (11). Let 

s* K = arg max Tr-(s). (24) 



a 



mm 

Ai,...,Ax 



Axt 



K 

k=l 



ixut 
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Using T' K (s* K ) = and T£ = —T", we can write 

f K (s)=f K (s* K )+f' K (s* K )(s-s* K )+ f S ( f 1 f£(s 2 )ds 2 ) d Sl 

Js K V S K ) 

= f K (s* K ) - f ( f 1 T"(s 2 )ds 2 ) dsi 

Js* K \Js* K J 

< f K { S * K ) - r ( r x "(s 2 - s * K )ds 2 ) d S1 

Js K \Js K J 

= f K (s* K ) - X (s - s* K ) 

= T*(2K + 1) - X (s - s* K ), for all set, (25) 
for any C 2 function x such that %(0) = = an d 

x "(s) < T"{s* K + s) for all s. (26) 

It follows then from (25) that 

[ e^ 2K+1 >- T ^ds < c e T ^ 2K+1 \ 
Jm. 

where c = j R e~ x( - s ^ds. By (10). (11). T"(s) s ^±>° + oo, so we can choose a x such that c = \ 
while (26) and hence (25) hold for all sufficiently large K; for example 

X{s) = ci • 



6s 2 -s\ \s\ < 1, 
|s| — 3, |s| > 1, 



with the appropriate constant ci. Using Stirling's formula, we then get from (22). (23). for 
sufficiently large K, 



K 



k=l 

which is the upper bound in (14). 

To prove the lower bound in (14). we will use the following lemma. 

Lemma 2. Let z G C and {z k } k=1 C C. Let 

v = (l,z : z 2 : ... : z K )eC K+1 . 

Similarly, let 

v k = (1, z k , z 2 k ,..., 4) E C K+1 , k = 1, . . . , K. 
Then the standard I 2 distance p in C K+1 between v and the linear span of {v k } k=1 equals 

U k =l\ Z ~ Z k\ 



^1 + Y^=\ Hl<k 1 <...<k q <K nf=l Zk t ^ 



1/2- 



(27) 
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Proof. We have 



2 g(v,v 1: ...,v K ) 



g(vi, ...,v K ) 

where g(-) denotes the Gram determinant of the given system of vectors. Since Vk and v 
are (K + 1) -dimensional, g(v, v±, . . . ,vk) can be computed simply from the Vandermonde 
determinant for z, z±, . . . , zk- 

g(v,v 1 ,...,v K )= Y[ \z k -zi\ 2 , (29) 

0<k<l<K 

where we have denoted z = z. In order to compute g(v i, . . . , vk), we note first that it can 
be expressed, by the Cauchy-Binet formula, as 

K 

g(v u ...,v K )=J2 \ a k,s\ 2 , (30) 

where Ak, s is the K x K minor of the K x (K + 1) matrix (zl)^J^ t=0 obtained by removing 
the row (zfc)fc = i. Note that ^k,k is the usual Vandermonde determinant, and A^,o — 
A^s- rifeLi z k- We can compute A^ s for any s in a way similar to the usual inductive 
evaluation of the Vandermonde determinant. Namely, define for brevity 



t, t < s: 



t + 1, t>s. 

Then for < s < K we have, performing linear transformations with rows and columns, 
A,, =do1 [z^ 



l<k<K 
0<t<K-l 



det 



4' {t) -z%®, k<K 

Z K i K — S\ J i<k<K 

0<t<K-l 



= (-If' 1 det (z£' (t) - # (t) ) 



\<k<K-l 
KKX-1 



' Hs{t)-1 \ K-l 

det ( E 4 i 4 s(tM " Ti E(ZK-Zk) 

T t =0 J l<k<K-l k=l 

l<t<K-l 

Ha(t)-l \ K-l 

= det[ E 4 t 4 s( * )_1_r< IK**-**) 

r t=/ls (t-l) / l<k<K-l k=l 

l<t<K-l 

-/*»(*)-! ^ / „\ K-l 

=<tot i<2-., M (M) 

z fe + z fc , t — sj \<k<K-i k=i 

l<t<K-l 
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K-l 



(z K A K _ 1:S + A X -i, s -i) ]J ( z k - z k)- 



(31) 



k=l 



Similar identities hold if s = or s = K, but with one of the terms A jR -_ 1 jS _!, z^A^_ l s 
missing: 

= z K A K _ lfi Yl (z K - ^), &k,k = A K _ 1)K _i ]J (z x - z fc ). (32) 



fc=i 



fe=i 



Iterating identities (31), (32) K times, we get 



n ( z i - yi 



K-s 

n z k t - 



Substituting this equality in (30) and combining with (28) and (29). we get (27) with q = 
K-s. □ 

To derive now the lower bound in (14), fix a t = t (K) > 0, to be chosen later. Using 
monotonicity of G(t) for t > 0, which follows from (9), we write: 



a T -(r, \ k = mm 



K 



e ixt _ 



k=i 



G(t)dt 



> G 
= G 

> G 
= G 
= G 



(K+l)t 



'(K + l)t y 



2 

mm 

Ai,...,aW-<*+ 1 >*° 



K 



Axt 



ix k t 



k=l 



dt 



mm 



2 J Ai,...,a k J_(£+l)«a 
'(tf + l)fo' 



2 



(^-i)*o if 

2 _ 

1)*0 ^ 
2 /=0 

K 



\K + l)t c 



min 



t min E 



Ai,...,Ak 



Z=0 



e ixt e iixt _ ^2 \ k e iXkt e ilXkt ° 
k=i 

K 

e ilxt _ ^ e i(x k ~x)t e ilx k t 
k=l 

K 

e iixt _ \ k e ilXkt ° 



dt 



dt 



k=i 



hp , 



where p is the distance defined as in Lemma 2. for z = e lxt °,z k = e lXkt °. Since \z\ = \z k \ = 1, 
the denominator in (27) is bounded from above by 2 K . Therefore, 



> G 



(K+l)t \ t 



2 2K 



K 

n 

k=i 



e ixt _ e ix k to 1 2 



Let us assume that 



tn < 



IT 



(33) 
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In this case, since x,x k E [— 1, 1], we have ^ x ^ to < |, hence 



2 2' 

| e *x* _ e ^o| = 2 sin I* ~ **!*«> > 4|x-x fc |t = 2j*ol ^ 

' ' 2 - 7T 2 7T ' ' 



Therefore, assuming 

ff W,£ e ( il H^)T57rni--^r- (34) 
Now let us choose t so that 



fc=l 



(K+l)t s * k 
2 

where s^- is given by (24). Then, if (33) holds, we get from 

o2X+l K 



o2_ftT+l X 

3 T*(2X-+l)-(2A-+l)ln(«-+l)f__ JJ ^ _ Xfe |2_ 



fc=l 



This implies the lower bound in (14), since \ > -. We have to check, however, that condition 
(33) is fulfilled. The value s* K satisfies ^condition 2K + 1 = T'(s^) = e s *S'(e s *). Since 
S'(t) — )• +oo as t — )• +oo, it follows that 

e s * = o(2K + 1) as K -»■ oo. (35) 

Therefore to — as — )• oo, so (33) is fulfilled for sufficiently large K. 
We now prove (15). Since T(s* K ) > for sufficiently large K, we have 

F(K) _ T-(2K + 1) - (2K + 1) In A' _ _ _ ^ R £ _ ^ ^ ^ ^ 



2JT + 1 2JT + 1 A 2K + 1 



where we used ( 35) in the last step. 

It remains to prove that F(K) monotonically decreases for sufficiently large K. We want 
to show that 

!?Sl=2(7-)'<2ff+l)- 2 ■»*-—<,,. 

dK K 

It suffices to show that 

(T*)\2K + 1) - \n{2K + 1) ->• -oo, as AT -»• +oo. 
By duality of the Legendre transform, this is equivalent to 

s — ln(T'(s)) — > — oo, as s — > +oo, 
which follows from (13), (10). 
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6 Proof of Theorem 3 

In this section, {xk}k%i denotes the optimization trajectory obtained by (2), (3) with x\ = 0. 

We start proving Theorem 3. by first noting that, under the hypotheses of the theorem, 
the mean expected value of the objective function / = — G at each point of [—1, 1] is exactly 
equal to its actual value, throughout the whole optimization process: 

m ^ fe ,/(* fc )}Li = fW' Vx G - L 

Indeed, by (5), m X ;{x k ,f{x k )}% 1 * s the unique interpolant of the function / at the points 
Xi,...,xk having the form Ylk=i ^kG(x — x k ) with some coefficients But f(x) = —G(x — 
xi) is of this form, so it is equal to the interpolant. 

Since / attains its minimum at x\ = 0, we have fx = f* = — G(0) for all K, hence the 
expected improvement can be written as 

W fc ,/fe)}f = » = E(-G(0) -min(-G(0),^)|fe fc = /(^)}f=i) 

G(o) f (t + G(x)f 
exp ' 



V 2 ™x;{x k }« =1 J -°° [ lo . 



x;{x k } 



K 



Lemma 3. For any h > 



G(0)-G(x)) 



2 




1 _ h 2 r°° (™+hf 
-e < e 2 wdw < e 2 . 



2 

Proof. On the one hand, 

fOO (tu+h ) 2 fOO _^2_ _ h 2 _ h 2 

/ e 2 u>au> < / e 2 e 2 wdw = e 2 , 
jo Jo 

where we have used hw > 0. On the other hand, 

fOO (w+h) 2 _ w 2 _ h 2 1 _ fc 2 

/ e 2 u>au> > / e e waw = -e , 
Jo - Jo 2 

where we have used /ik; < □ 
In the sequel, we shorten the notation for the expected improvement to Ir(x). 

Lemma 4. Under the assumptions of Theorem 3_, there exist constants Ci,c 2 > ; depending 
only on the kernel G, such that the following assertions hold for K large enough. 

1. For allx G [-1, 1] 

f 2 ^ (F(K)-K)/2 K 

Ik{x) > exp - Cl e K - F ^ \x\ ft W - x\ (37) 

L llk=2 \ x k — x\ ) zn k=2 

Ik(x) < exp - C2 e- 2 ^-^W-^ — |x| II I** - A (38) 



nf= 2 -^l 2 J v 7 ^ 



fc=2 
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2. If, additionally, 

\x\ < - min Ixfcl, (39) 
1 1 2k=2,...,K l fc " v ; 



then 



( 2 } F(K)/2( 4 \-K/2 K 

Ik(x) > exp - Cl (4e)«e- F W * p |x| n 1**1 (40) 

^)<exp(-c 2 ( T ) c - W _ CT _|_^i_ W n|x 4 | (41) 



Proof. 



1. Since G is strictly positive definite, we have G(0) > G(x) for all x 7^ 0, and hence there 
exist constants c' 1; c' 2 > such that 

c[x 2 < G(0) - G(x) < c' 2 x 2 , for all x G [-1, 1]. 

Using Theorem 2, identity (36) and Lemma 3., we then get (37), (38) with c\ = 
( c 2) 2 ? c 2 = ( c 'i) 2 /2- Note that the k — 1 factor is not present in the products over 
k in the exponentials, as it equals x 2 and has been cancelled with x 2 in the numerator. 

2. From the inequalities 
we obtain 

] \ A" K-l k 



n i^i - n - x i ^ ( 9 ) n 

I O Z O \^/ ! o 



and then substitute these latter inequalities in (37). (38). 



□ 



Lemma 5. Under the assumptions of Theorem 3_, for all sufficiently large K: 
a) 

Ik(x k+1 ) > e 2F ^ ft W\\ (42) 



k=2 

b) 

K 

D 2F(K) 



\x K+ i\>e 2F ^ll\x k l (43) 

k=2 



17 



c) 



\x K+1 \>e 2KF ^ K \ (44) 
\I K (x K+ i)\>e 2KF ^\ (45) 



d) 

\x K +x\ < e F{K)/3 . (46) 



Proof. 



a) Given K, let us choose x G [—1,1] so as to make the expression in braces in (40) equal 
to -1, i.e., 

\x\=c^ /2 (Ae)' K / 2 e F ^ 2 f[\x k \. 

k=2 

By Theorem 2, F(K)/K — > — oo, so condition (39) holds if K is large enough, and we 
can apply inequality (40): 

Ik{x) > c 3 (4e)-VW f] W\\ (47) 

k=2 

with some constant C3 depending on G. By definition of Xk+i, Ik(xk+i) > Ik{x). 
Finally, using again F{K)/K — > —00, we arrive at (42). 

b) Suppose that (43) is violated for infinitely many if e N. Then, for sufficiently large 
such K, the value a^+i satisfies condition (39) with x = Xk+i, and we can apply bound 
(41). It follows that for such K 

e 5F(K)/2^3e^K K 

Ik{x k +\) < —7= II M 2 = o(I K (x K+1 )), 

2V2TT k=2 

where in the last equality we have used (42) and that F(K)/K — > —00. Therefore the 
hypothesis that (43) is violated for infinitely many K is false. 

c) To show (44), we continue inequality (43) by iteratively applying it to x k+1 with k = 
K, K — 1, . . . , K + 1, where K + 1 is the lowest value for which it is valid: 

K 



\x K+ i\>e 2F ^\l\x k 

k=2 

k=2 

> e 2F(K)+2F(K-l)+4F(K-2) -Q 



k=2 

K-2 



1 



k=2 
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>e 2F w + sSo 2K - fc ^)ni^i 2K " Ko 

k=2 



/ K \ 2 K 



. fc=2 



A' 



(2-^ +1 -l)F(K)+ln n|x fc 



2~ K 



, fc=2 



where we assumed without loss of generality that K > 2 and that monotonicity of 
F(k) established in Theorem 2. holds for k > K . The second exponential factor in the 
last expression is greater than 1 for sufficiently large K due to F(K) — > — oo, which 
implies (44). 

Inequality (45) is proved in the same way using (42) instead of (43) in the first step. 

d) Suppose that (46) is violated for infinitely many K G N. First, observe that for 
sufficiently large such K the bound (38). when applied to x — xk+i, implies 

Ik(x k+ i) < eM-e- F(K)/4 }. (48) 

Indeed, consider the first, exponential factor in (38). Using > e F< - K ^ 3 , the 

inequalities — xk+i\ < 2, and F(K)/K — > — oo, we can write: 

_ C2e -2K-F {K ) _ xk+i < - C2 (2e)- 2K e- F ^ 3 < -e~ F ^l A 

Ilfc=2 \ x k — XK+l\ 2 

for K large enough. As for the remaining factor, 

e F(K)/2+K K 

7k= — kiml II \ x k — XK+l\, 

it is bounded by 1 for large K, again due to F{K)/K — > — oo. We thus conclude (48). 
Now, combining (48) with (45), we see that 

e 2 *™ < I K (x K+1 ) < exp{-e-^/ 4 }. 

This implies 2 K (—F(K)) > e~ F ( K ^ 4 . Since c < e c / 8 for sufficiently large c, we get 2 K > 
e ~ F ( K )/8, B u t this inequality is violated for all K large enough, since F(K)/K — > — oo. 
Therefore our assumption that (46) is violated for infinitely many K e N was wrong. 

□ 

Inequalities (44) and (46) form the statement (18) of Theorem .3. Since F(K)/K — > — oo, 
from (46) we conclude \xk\ — > 0, which completes the proof. 
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